Standardized Test Analysis


Part 1


Problem Statement

The state of California has many school districts, with varying ACT and SAT performances. In order to improve the performances in these standardized tests, it is often not enough to just increase funding and resources to districts with poorer performances without identifying the underlying reasons behind the poor results.

This project aims to analyse the student performances on SAT and ACT tests with regards to a number of socio-economic indicators of each district. This helps to identify underlying factors which contributes to potential poor performances so that the state can take a more targeted approach in allocating funds and resources, as well as recommending interventions in areas of concern.

Contents:

Background

The California Education System

Despite being the technological powerhouse of United States and the 5th highest state in terms of GDP per capita, California seems to underperform when it comes to quality of pre-tertiary education. It ranks 40th in the quality of pre-K121 education (Source) and 41th in quality of public schools (Source). This is not for the lack of education resources either, as the state ranks 20th in terms of education spending(Source). There is therefore an inefficient utilisation of education spendings and an urgency to take a more targeted approach in terms of where and how the money is spent.

Education Quality vs Spending

Source

A little on SAT and ACT

The SAT and ACT are standardized tests that many colleges and universities in the United States require for their admissions process. This score is used along with other materials such as grade point average (GPA) and essay responses to determine whether or not a potential student will be accepted to the university.

The SAT has two sections of the test: Evidence-Based Reading and Writing and Math (source). The ACT has 4 sections: English, Mathematics, Reading, and Science, with an additional optional writing section (source). They have different score ranges, which you can read more about on their websites or additional outside sources (a quick Google search will help you understand the scores for each test):

Since the 1940's, an increasing number of colleges have been using scores from sudents' performances on tests like the SAT and the ACT as a measure for college readiness and aptitude (source). Although the Covid-19 pandemic have promopted some US universities to make ACT or SAT non-mandatory for the 2020 admission cycle, they are making a comeback as normality resumes.

Going to a college is still an important engine for social mobility in the US, and going to a university is still correlated with higher earnings Source. Given this, ACT and SAT are still important benchmarks for the education system.

Datasets

To fulfill the objective of our analysis, we will use the following datasets:

Datasets provided as part of project

Datasets from other sources

Useful Functions

  1. Manually calculate mean:
  1. Manually calculate standard deviation:

    The formula for standard deviation is below:

    $$\sigma = \sqrt{\frac{1}{n}\sum_{i=1}^n(x_i - \mu)^2}$$

    Where $x_i$ represents each value in the dataset, $\mu$ represents the mean of all values in the dataset and $n$ represents the number of values in the dataset.

  1. Data cleaning function:

    A function that takes in a string that is a number and a percent symbol (ex. '50%', '30.5%', etc.) and converts this to a float that is the decimal approximation of the percent. For example, inputting '50%' in your function should return 0.5, '30.5%' should return 0.305, etc. Make sure to test your function to make sure it works!


Part 2


All libraries used to be added here

Data Import and Cleaning

For provided datasets (ACT and SAT datasets)

Display the provided data

Missing Values

From the dataframes, we can see that some values are displayed as NaN, some, such as "NumTstTakr" are "0", while others are displayed as *.

To summarise, as we are interested in district level analysis, we will group the data by education district. As such, cleaning up of missing values may not be of significant importance as we will aggregate the values by district. For example, we may sum up the total students enrolled within a district or take the mean of test scores for the available scores in the district.

Nonetheless, we will first visualize the the missing values.

Data type conversion

Observations from missing data

For now, we will convert the data types from object to string/float so that we can aggregate the data by district.

Get district level data

We would get the district level data, which is indicated by "D" in "RType column

Adding additional columns (while dealing with missing data)

Next, we will add additional information to the dataframe, while taking note of the following conditions due to missing values:

For ACT

For SAT

Merge ACT and SAT

Next, we will merge the ACT and SAT datasets to get a dataset with both test data

It can be seen that there are 17 more districts with SAT test than ACT test.

ACT test also has more missing values than SAT test.

Let us see how many records there are with missing scores/benchmarks for both tests.

There are 119 districts with missing score/benchmark values for both tests. For now, we will not remove these as we can still get the participation rate from these records.

The datatypes for different variables also looks to be what we wanted them to be.

Other data souces (district level socio-economic-demographic data)

We will now attempt to process additional datasets about the socio-economic-demographic condition of the districts. This is so that we can draw deeper insights regarding what might have resulted in the poor results in some districts.

Import datasets

Add additional columns

We will add just two more columns for percentage of students who went to charter school and non-charter school.

Obtain subset of dataframes

As both dataframes could certain information of the same categories, we will extract a subset of features from both dataframes before merging them.

Merging dataframes

We will then merge the subsets of both dataframes, making use of NCID_ID as key.

Data type conversion

For consistency purposes, we will convert all percentages to decimal approximation of the percent

Add additional column

We will add an additional columns to indicate the majority race of the district.

Merge all dataframes into one single dataframe

After merging the dataframes with the test results and the dataframes with the demographic information, we have extracted 475 school districts with 60 columns each.

Although not all rows have complete information, we will keep all rows for the analysis as at least some data from each row can be useful for our exploratory data analysis.

Data Dictionary

Now that we've fixed our data, and given it appropriate names, let's create a data dictionary.

A data dictionary provides a quick overview of features/variables/columns, alongside data types and descriptions. The more descriptive you can be, the more useful this document is.

Example of a Fictional Data Dictionary Entry:

Feature Type Dataset Description
county_pop integer 2010 census The population of the county (units in thousands, where 2.5 represents 2500 people).
per_poverty float 2010 census The percent of the county over the age of 18 living below the 200% of official US poverty rate (units percent to two decimal places 98.10 means 98.1%)

Here's a quick link to a short guide for formatting markdown in Jupyter notebooks.

Provided is the skeleton for formatting a markdown table, with columns headers that will help you create a data dictionary to quickly summarize your data, as well as some examples. This would be a great thing to copy and paste into your custom README for this project.

Note: if you are unsure of what a feature is, check the source of the data! This can be found in the README.

Feature Type Dataset Description
NCES_ID int64 california_school_district_info.csv School district identifier for National Center for Education Statistics (NCES)
CDCode int64 california_school_district_info.csv Official school district Identifier
county_name object california_school_district_info.csv County name
district_name object california_school_district_info.csv School district name
district_type object california_school_district_info.csv Type of school district
urban_locale object california_school_district_info.csv Urban locale of school district (city, suburb, fringe, rural, etc)
total_enrolment int64 california_school_district_info.csv Total student enrolment
charter_school_percentage float64 california_school_district_info.csv Percentage of students in charter school
non_charter_school_percentage float64 california_school_district_info.csv Percentage of students in non-charter school (Public)
homeless_student_percentage float64 california_school_district_info.csv Percentage of students who are homeless
migrant_student_percentage float64 california_school_district_info.csv Percentage of students who are migrants
dropout_percentage float64 california_school_district_info.csv Percentage of students who dropped out
suspension_percentage float64 california_school_district_info.csv Percentage of students suspended
total_population float64 california_school_district_NCES_info.csv Total population of district
median_household_income float64 california_school_district_NCES_info.csv Median houshold income of district
total_household float64 california_school_district_NCES_info.csv Total household number of district
white float64 california_school_district_NCES_info.csv Percentage of district population who are white
black float64 california_school_district_NCES_info.csv Percentage of district population who are black
hispanic_or_latino float64 california_school_district_NCES_info.csv Percentage of district population who are hispanic_or_latino
asian float64 california_school_district_NCES_info.csv Percentage of district population who are asian
american_indian/alaskan_native float64 california_school_district_NCES_info.csv Percentage of district population who are american indian/alaskan native
hawaiian_and_other_pacific_islander float64 california_school_district_NCES_info.csv Percentage of district population who are hawaiian and other pacific islanders
some_other_race_alone float64 california_school_district_NCES_info.csv Percentage of district population who are from other races
two_or_more_races float64 california_school_district_NCES_info.csv Percentage of district population who are from or mores
housing_structure_built_2000_and_after float64 california_school_district_NCES_info.csv Percentage of houses built 2000 and after
housing_structure_built_1970-1999 float64 california_school_district_NCES_info.csv Percentage of houses built from 1970 to 1999
housing_structure_built_before_1970 float64 california_school_district_NCES_info.csv Percentage of houses built before 1970s
household_with_broadband_internet float64 california_school_district_NCES_info.csv Percentage of households with internet broadband
housing_structure_type_house float64 california_school_district_NCES_info.csv Percentage of district population who live in houses
housing_structure_type_apartment float64 california_school_district_NCES_info.csv Percentage of district population who live in apartments
speak_english_only_children float64 california_school_district_NCES_info.csv Percentage of district population who speak english ony
under18_with_disability float64 california_school_district_NCES_info.csv Percentage of students with disability
under18_with_health_insurance float64 california_school_district_NCES_info.csv Percentage of students with health insurance coverage
family_income_below_poverty float64 california_school_district_NCES_info.csv Percentage of students with family income below poverty level
married_couple_household float64 california_school_district_NCES_info.csv Percentage of students who are from married couple households
cohabitating_couple_household float64 california_school_district_NCES_info.csv Percentage of students who are from cohabitating couple households
female_householder_household float64 california_school_district_NCES_info.csv Percentage of students who are from households with female householder only
male_householder_household float64 california_school_district_NCES_info.csv Percentage of students who are from households with male householder only
parents_not_in_labor_force float64 california_school_district_NCES_info.csv Percentage of students with parents not in labor force
bachelors_or_higher float64 california_school_district_NCES_info.csv Pencentage of students with parents who possess at least a bachelor's degree
expenditure_per_student float64 california_school_district_NCES_info.csv Expenditure per student of the school district
majority_race object california_school_district_NCES_info.csv The majority race of the district
white_majority object california_school_district_NCES_info.csv Whether the majority race of the district is white
act_enroll float64 act_2019_ca.csv Enrollment of Grade 12
act_num_test_taker float64 act_2019_ca.csv Number of Test Takers for ACT
act_participation_rate float64 act_2019_ca.csv Participation Rate of Grade 12 in ACT
act_average_reading_score float64 act_2019_ca.csv Average ACT Reading Score
act_average_english_score float64 act_2019_ca.csv Average ACT  English Score
act_average_math_score float64 act_2019_ca.csv Average ACT Math Score
act_average_science_score float64 act_2019_ca.csv Average ACT Science Score
act_num_above_average_score float64 act_2019_ca.csv Number of Test Takers Whose ACT Composite Scores Are Greater or Equal to 21.
act_percentage_above_average_score float64 act_2019_ca.csv Percent of Test Takers Whose ACT Composite Scores Are Greater or Equal to 21
sat_enroll float64 sat_2019_ca.csv Enrollment of Grade 12 and Grade 11
sat_num_test_taker float64 sat_2019_ca.csv Number of Test Takers for SAT
sat_participation_rate float64 sat_2019_ca.csv Participation Rate of Grade 12 and 11 in ACT
sat_num_erw_benchmark float64 sat_2019_ca.csv The number meeting the Evidence-Based Reading & Writing (ERW) benchmark established by the College Board based on the New 2016 SAT test format
sat_percentage_erw_benchmark float64 sat_2019_ca.csv The percent of students who met or exceeded the benchmark for Evidence-Based Reading & Writing (ERW) test
sat_num_math_benchmark float64 sat_2019_ca.csv The number of students who met or exceeded the benchmark for the New SAT Math test format
sat_percentage_math_benchmark float64 sat_2019_ca.csv The percent of students who met or exceeded the benchmark for SAT Math test
sat_num_both_benchmark float64 sat_2019_ca.csv The total number of students who met the benchmark of both Evidence-Based Reading & Writing (ERW) and Math
sat_percentage_both_benchmark float64 sat_2019_ca.csv The percent of students who met the benchmark of both Evidence-Based Reading & Writing (ERW) and Mat

Exploratory Data Analysis

Summary Statistics

Just looking at the ACT and SAT data, we see that:

Standard Deviation

Which California district has the highest and lowest test benchmarks

We will use percentage of students who meets the benchmark for both SAT and ACT.

Highest
Lowest

We see certain recurring names in top 5 and bottom 5 of both tests.

These are school districts which consistently do well or not so well.

Visualize the Data

Analysis of Dependent Variable - ACT and SAT test results

Histogram

In terms of participation rate. SAT seems to be the more popular choice amongst students in California, with the median at about 0.32 compared to about 0.16 for ACT. More ACT participation rate for districts tend to be below 0.3 percent, while there is still sizeable participation rate for SAT which are more than 0.3. This perhaps means that SAT is a more representative indicator for California students as it takes a larger sample size of the cohort.

In terms of percentage benchmark above average, both tests are similar in terms of distribution. With a rather large spread and a more obvious trough (just above 0.4) in between 2 crests.

Correlation between ACT and SAT tests

By comparing between the benchmark percentages for both ACT and SAT, we see a high level of correlation between the two tests. As such, it should not matter that much which indicator we used.

For SAT, it has the advantage of more students in California taking it.

For ACT, it has the composite score, which is the average score across subjects for a district (while SAT benchmark makes the percentage of test takers scoring above a certain level), this may offer a higher correlation with the independent variables.

Correlaton between SAT participation rate and benchmark

There does not seem to be a significant correlation between participation rate and benchmark.

Analysis of relationship between independent and dependent variables

We will extract variables with with correlation larger than or equals to 0.5 for ACT participation rate and act average composite score.

The above independent variables have at least a 0.5 correlation with act average composite score. Next we will examine the variables more closely.

Questions to ask?

Does money buy results?

Even before deciding how and where its best to spend the education budget, it is perhaps wise to see if money actually brings better results. Since both the government level education budget and parental investment can potentially bring about results, we will analyse results with regards to both expenditure per student and medium household income of each education district.

Does increasing expenditure per student bring about results?

Expenditure per student refers to district level educational spending per student, and consists of the following areas:

Refer to Source for more details

These are money that is spent directly to support the educational system. One would expect that directly investing in the education system would bring about better result, but is that the case?

We will see if increasing spending is actually correlated with higher percentage of students scoring above benchmark, which would indicate a higher chance of success in college.

Looking at the correlation, the correlation of 0.028 does not show a strong correlation between expenditure per student and the SAT benchmark result.

However, we do see a concentration of expenditure between 10,000 USD and 20,000 USD per student, but within this range of expenditures, we see a rather large disparity for the SAT score.

We will perform another analysis by grouping the expenditure into different bins.

By grouping the expenditure per student into 3 groups, we do see an observation:

But still, shoud we simply increase funding for schools without thinking about how they spend it?

Charter vs Non-charter schools

Another angle which we may approach the expenditure debate from that of charter vs non-charter schools.

Just for a basic background information, charter and non-charter schools are both public schools in that they receive funding from the government. The two differs in that charter schools are run by independent groups, even non-profit groups, and they do not need to adhere to most guidelines which government non-charter public schools.Essentially, more freedom in the way the school is ran. However, they also tend to receive less funding from the government than non-charter schools Souce, in part because they are not funded for facilities. This may also mean that facilities cost may sometimes encroach into the operating cost, hence less spending per student.

Let us first see if school districts with a higher percentage of charter schools indeed spend lesser on students.

There is a -0.48 correlation, not a particularly strong one, but considering that we are looking at the correlation for the entire district (remember that each district consists of both charter and non-charter schools) consisting both charter and non-charter schools, we will accept this correlation coefficient as sufficient for now.

From the scatter plot, we see that the values for districts with 0% charter school can reach a higher level of expenditure per student and overall we do see a decreasing trend.

Now let us see the ACT performance for districts with different percentages of charter schools.

There appears to be decreasing trend, which may suggest that districts with a higher percentage of charter schools do in fact perform worse that those that do not.

However, the correlation coefficient is closer to 0 than the one with expenditure, which suggests that ACT performance depends lesser on expenditure.

To see if it is indeed the case, let us introduce another metric: ACT score per dollar spent on student. This would focus on how well the money is used by charter school judged by ACT achievement. So if a district with higher charter school percentage gets a higher per dollar expenditure ACT score, we would know that the teaching quality or other non-expenditure factors are actually making the charter schools more high performing in ACT. It can then be said that it is able to make more efficient use of expenditure.

This would also suggests that the effect of increasing expenditure may be limited and it would probably not be wise to further increase expenditure without improving efficiency.

We see that charter schools actually make use of their expenditures more efficiently, when judged in terms of ACT score per dollar spent on student. This shows that simply increasing funding to schools is not a sure way to improve test performance.

Do school districts with higher median household income perform better?

Another financial aspect which may influence test performance is the affluence of a students family background. A wealthier family could afford to provide more financial support to a student's academic development.

Examining whether this is the case could guide decision makers in better allocating resources. Perhaps more funds should go to support lower income families, perhaps a more long term approach is to improve the overall financial well-being of the community.

From earlier analysis, median household income has a moderate positive correlation of 0.65 with SAT benchmark percentage. We will take a closer look on the pattern.

The scatter plot shows a correlation between median household income of a school district and the SAT benchmark percentage. As per previous case, we will visualize the data again after grouping the income into different bins.

Let us look at income in terms of different percentiles. We will group the the income by percentile.

Here we are dealt the cold hard truth that money does seem to buy results, to the extend that students from wealthier families does perform in general better than their less wealthy counterparts. The money effect is especially more obvious for the top 75th percentile of median household income, where its 25th percentile of SAT benchmark percentage is almost as good as the 75th percentile for districts one median house income bracket lower.

In case, you ask if schools in districts with wealthier family also spent more on students, the answer is... not really. Quite the opposite, poorer districts can actually spend more in terms of school expenditure per student.

So you can see that family's financial background does play a large role in students test performance. And this can be a problem, because it further reduces the social mobility in the society as the advantage snowballs. All the more important that we dedicate resources to create a level playing field for students!

What other socio-factors might be correlated with both household income and test results?

Furthermore, as we see in many parts of the world, income disparity is often correlated with a series of deeper social divisions and inequality. To improve academic results at its root, we will need to identify these factors so as to provide more targeted recommendations. We shall explore these further.

Do some ethinic groups do better than others?

We will first explore the disparity in test results among ethnic groups

Which are the better performing ethnic groups in terms of the overall score?

We see a few clear trends:

The trends suggest that there is inequity in terms of educational achievement along ethinc lines.

If we take a look at the previous scatter plot of SAT benchmark against median household income, but with additional info for the majority race, we also see how educational and income inequity along ethnic lines are actually intertwined.

Are Asians better at Math?

Let us sidetrack a little to see if the popular notion that Asians are better at math and science holds true.

Compared with english and science, Asians does seem to excel in math

Household with Broadband Internet

We also see that another factor that has >0.5 positive correlation with the tests results is the percentage of household access to broadband internet. We shall examine this factor more closely.

We are seeing a normal distribution with a left skew, quite a higher concentration from the 0.8-0.9 range, while the minimum is in the 0.3-0.4 range. Next we will plot the broadband internet rate against the SAT benchmark percentage.

We see quite a clear positive releationship between percentage of household with broadband internet and the ACT score of the district. By also visualizing the median household income of the district as size of the bubble, we see that districts with lower percentage of households with internet are also typically those with lower median household income.

In the information age, access to internet is almost equivalent to access to information. Raising income of an area is a long term process, but in the short term, we could provide subsidized internet service to lower income households in order to provide access to knowledge which is essential to improving results in standardized tests.

Household type
Parents education level

The education level of parents should be correlated with the median household income and also broadband access, lets visualize the broadband access graph but with information for household income and parents education level.

By visualizing the same graph with additional information, we can see how the 3 variables are interrelated and their relationship with the ACT score.

We can further investigate the reason why parents with higher education level might have done differently in their children's education, perhaps things like better academic and career advice might help, but that is a topic for another day.

Geographical distribution

We will also attempt to visualize the distribution of participation rate and benchmarks in terms of geographical distribution of school districts, to see if there are obvious geographical patterns which influences the test scores.

We will visualize the geographical distribution of urbanisation in California.

Seems like many of the cities and suburban areas are concentrated along the western coast, in fact, the two coastal clusters are San Francisco and Los Angeles.

Upon exmaination, these districts are more likely to obtain higher ACT scores. In general, districts within cities and suburbs are more likely to obtain higher ACT score than towns and rural area, costal cities are more likely to obtain higher score than inland cities.

And of course, there are income disparities between districts as well.

We will also visualize the geographical distribution of SAT participation rate. Similar to the ACT score, the SAT participation rate also tend to be higher in coastal cities.

Conclusions and Recommendations

Key Takeaways and Recommendations

  1. Increasing education expenditure per student may only be useful in improving standardized test score to an extend, above 10,000 USD per student, the effect becomes less apparent

  2. From the experience of charter schools, we see that these schools are able to get better results with less expenditure per dollar spent, which further reinforce the notion that expenditure is limited in effectiveness and there is a dimishing marginal return of education expenditure

  3. We see that median household income of a district is more highly correlated with standardidized test results, help should focus on lower income households in order to create a level playing field for students

  4. We have also identified a few areas where there is both a disparity in income and disparity in standardized test result

    • Ethnic groups: We see that as the percentage of White Americans and especially Asian Americans in a district increases, standardized test result and median income also tend to be higher, the inverse is true for Hispanic/Latino Americans and Black Americans
      • More help should be focused on Hispanic/Latino as well as Black Americans
      • A separate research can be conducted to investigate what Asian American families are doing right to achieve higher standardized score
    • Broadband access: Districts with a higher broadband access also tend to have higher test results, the authorities could consider providing subsidized broadband services in order to create access to information and knowledge.
    • Household type: When a student comes from married household, they are more likely to score better in standardized tests as compared to those from cohabiting and single parent households, more work can be done to find out what non-married household student needs, such as after school care and tuition
    • Parent's education: Students with parents who achieve at least a bachelor's degree are more likely to do better than those who do not, similarly more work should be done to identify needs of these groups of students, and help be given to services such as career guidance services
    • Urban locale: We also identified geographical disparity in test results, where districts in coastal cities and suburbs tend to do better than inland towns and rural areas. this finding points to areas where more help can be targeted.

The analysis allowed us to have a glimpse into the socio-economic division underlying the disparity in tests scores. As unversity admission is still an important pathway to social mobility, it is crucial that the underlying divisions are addressed, so that standardized tests do not perpetuate the existing division in the society.